First of all, import the datasets i obtained from ORFEO
bcast_default.df<- read.csv(file = "bcast_default.csv")
bcast_linear.df<- read.csv(file = "bcast_linear.csv")
bcast_chain.df<- read.csv(file = "bcast_chain.csv")
bcast_binarytree.df<- read.csv(file = "bcast_binarytree.csv")
Now we can bind them into a single df:
bcast.df<- rbind(bcast_default.df,bcast_linear.df,bcast_chain.df,bcast_binarytree.df)
At the moment let’s fix the size of the message to 2
MPI_CHAR
linear_map_plt<- bcast_linear.df %>% filter(MessageSize == 1) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
xlim(2, 48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = 'Latency (μs)',
title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
color = 'Allocation')+
theme_light()+
theme(
legend.position = "bottom",
legend.text = element_text(size = 12)
)
linear_map_plt
Comments:
As we did above for linear broadcast, we fix the size of the message
to 2 MPI_CHAR
chain_map_plt<- bcast_chain.df %>% filter(MessageSize == 1) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
xlim(2, 48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = 'Latency (μs)',
title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
color = 'Allocation')+
theme_light()+
theme(
legend.position = "bottom",
legend.text = element_text(size = 12)
)
chain_map_plt
Comments: - same consideration as before - We can notice a huge difference w.r.t linear when we change node. For linear, when we have np>=24 the 3 different allocations seem to converge. And it makes sense, since linear sends from process 0 to all the other, and when we have processes in both nodes it will be the same (the same steps will be executed independently from the allocation). However, this is not the case for chain algorithm, where a precise order is followed. If we distribute processes by node, obviously the distance of messages will be longer than allocating by core or socket. And this is evident in the plot. Differences between allocation by core and socket, instead, are not so evident (probably the difference of communication btw cores in the same socket and cores in the same node is not so large…maybe working on just 1 numa region is beneficial?!)
binarytree_map_plt<- bcast_binarytree.df %>% filter(MessageSize == 1) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
xlim(2, 48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = 'Latency (μs)',
title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
color = 'Allocation')+
theme_light()+
theme(
legend.position = "bottom",
legend.text = element_text(size = 12)
)
binarytree_map_plt
bcast_default.df %>% filter(MessageSize == 1) %>%
ggplot(aes(x = Processes, y = Latency, color = Allocation)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
scale_color_manual(values = c("#82CD47","#E67E22","#3498DB")) +
xlim(2, 48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = 'Latency',
title = 'Latency vs #cores (MessageSize=1 MPI_CHAR)',
color = 'Allocation')+
theme_light()
As for now, let’s focus on the comparison between Linear vs Chain
bcast.df %>% filter(MessageSize == 1) %>% filter(Allocation == "core") %>% filter(Algorithm %in% c("Linear","Chain")) %>%
ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
scale_color_manual(values = c("#E67E22","#3498DB")) +
xlim(2, 48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = 'Latency',
title = 'Broadcast Latency vs #cores (MessageSize = 1)',
color = 'Algorithm')+
theme_light()
Comments:
Remember that we are dealing with --map-by core
Region 1 (within socket): chain and linear performs ~ the same–> this reveals that probaably the communication intra socket (so between cores in the same socket) is so fast that the implementation algorithm does not impact significantly (or the difference is imperceptible given the y-axis scale of the entire plot)
NB: here it might be interesting to show point to point latency (since we are working on a short size message, so we are considering “pure” latency). And from this observation then we can move to the plots where i increase message size to show the difference in algorithms performance
Region 2 (within node): chain and linear still perform ~ the same–> maybe also the communication intra node (between cores, even if they are in different sockets) is too fast that the difference is not so evident. Anyway, there is a small difference that highlights chain having a smaller latency. Reasoning on the implementation of chain, it makes sense: both because chain work with contiguous core (while flat has more “core-jumps”), both because the paper stated that MPI chain sends the message dividing data into chunks (while flat is not able to do that)
Region 3 (outside node): obviously in this region the latency is significantly higher, and the difference between chain and linear algorithm is more evident. As explained before, this makes sense: chain works with contiguous cores, while flat tree sends from 0 to all other ranks!
BONUS: in the plot above the difference between algorithms was not so evident, but this might derive from the small size of the message. Even if flat tree is less efficient than chain, the fact that we are sending a small message (1 MPI_CHAR) makes both the algorithms super fast.
Recall that the choice of considering at first a small sized message was to exclude as much as possible the time needed for sending/storing the message and to focus as much as possible to the “pure” latency
BUT, if we now increase the size of the message, the different performances of the 2 algorithms will be more evident:
Plots=list()
sizes=c(64,256,2048)
for(i in 1:length(sizes)){
# Filter by current size
current_plot<- bcast.df %>% filter(Allocation == "core") %>% filter(Algorithm %in% c("Linear","Chain")) %>% filter(MessageSize == sizes[i]) %>%
ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text",x=c(12,24),y=c(0,0),label=c("12 Cores","24 cores"),vjust=0,hjust=-0.1,color="#616A6B")+
scale_color_manual(values = c("#E67E22","#3498DB")) +
xlim(2,48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = expression('Latency (µs)'),
title = paste0("Broadcast Latency vs #cores (MessageSize =",sizes[i],")"),
color = 'Algorithm')+
theme_light()
Plots[[i]]<- current_plot
}
Plots[[1]] | Plots[[2]] | Plots[[3]]
NB: this thing about message size can be commented mentioning the way chain is more performant than linear, since chain allows for message partitioning (and linear NO!)…eventually we can cite the reference!
Noticed that allocation by core is associated to the lowest latency, let’s fix Allocation to core and compare the algorithm behind the MPI broadcast operation.
Considering a message size = 2 MPI_CHAR we obtain:
bcast.df %>% filter(MessageSize == 1) %>% filter(Allocation == "core") %>%
ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
scale_color_manual(values = c("#27AE60","#E67E22","#8E44AD","#3498DB")) +
xlim(2, 48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = 'Latency',
title = 'Broadcast Latency vs #cores (MessageSize = 1)',
color = 'Algorithm')+
theme_light()
Instead, if we let the size vary we get something similar as seen in Linear vs Chain
Plots=list()
sizes=c(64,256,2048)
for(i in 1:length(sizes)){
# Filter by current size
current_plot<- bcast.df %>% filter(Allocation == "core") %>% filter(MessageSize == sizes[i]) %>%
ggplot(aes(x = Processes, y = Latency, color = Algorithm)) +
geom_line(linetype = 1) +
geom_point(size=1)+
geom_vline(xintercept = c(12, 24), linetype = "dashed", color = "#616A6B")+
annotate("text", x = c(12,24), y = c(0,0), label = c("12 Cores","24 cores"), vjust = 0, hjust = -0.1,color="#616A6B")+
scale_color_manual(values = c("#27AE60","#E67E22","#8E44AD","#3498DB")) +
xlim(2, 48) + # Adjust the x-axis limits as needed
labs(x = 'N. of processes (#cores)',
y = expression('Latency (µs)'),
title = paste0("Broadcast Latency vs #cores (MessageSize =",sizes[i],")"),
color = 'Algorithm')+
theme_light()
Plots[[i]]<- current_plot
}
Plots[[1]] | Plots[[2]] | Plots[[3]]
bcast_linear.df %>%
filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,1024,2048,16384)) %>%
ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
geom_line(linetype=1) +
geom_point() +
labs(
x = "Processes",
y = "Latency",
title = "Latency vs Processes for Linear algorithm (Allocation='core')",
color = "Message Size"
) +
xlim(c(2,48))+
theme_light()
Note : i did NOT considered all the messaze sizes since the latency grows exponentially-like w.r.t message size. We will consider it better when building a model…
bcast_chain.df %>%
filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,2048,4096,8192)) %>%
ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
geom_line(linetype=1) +
geom_point(size=1) +
labs(
x = "Processes",
y = "Latency",
title = "Latency vs Processes for Chain algorithm (Allocation='core')",
color = "Message Size"
) +
xlim(c(2,48))+
theme_light()
bcast_binarytree.df %>%
filter(Allocation == "core") %>% filter(MessageSize %in% c(2,256,2048,4096,8192)) %>%
ggplot(aes(x = Processes, y = Latency, color = factor(MessageSize))) +
geom_line(linetype=1) +
geom_point(size=1) +
labs(
x = "Processes",
y = "Latency",
title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')",
color = "Message Size"
) +
xlim(c(2,48))+
theme_light()
Now we fix allocation to core, as it was the more performant allocation for all the algorithms. Our aim is to fit a performance model able to predict latency basing on data
Disclaimer: it might be biased from my statistical background, but I was not able to implement latency models explained in the paper…
First of all, let’s plot the variables to spot some relations:
bcast_linear.df %>%
filter(Allocation == "core") %>%
plot_ly(
x = ~MessageSize,
y = ~Processes,
z = ~Latency
)%>%
add_markers(
marker = list(color = ~Latency,
line = list(color = "black",width = 0.8 ))
) %>%
layout(
xaxis = list(title = "Processes"),
yaxis = list(title = "Latency"),
title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')"
)
We can implement a simple linear model (it makes sense to remove the intercept)
fit.linear<- lm(formula= Latency ~ -1+MessageSize+Processes,
data=bcast_linear.df %>% filter(Allocation == "core")
)
summary(fit.linear)
##
## Call:
## lm(formula = Latency ~ -1 + MessageSize + Processes, data = bcast_linear.df %>%
## filter(Allocation == "core"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1133.70 -32.02 -20.19 -9.89 650.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## MessageSize 1.136e-03 2.454e-05 46.294 < 2e-16 ***
## Processes 1.083e+00 2.269e-01 4.775 2.36e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 137.4 on 502 degrees of freedom
## Multiple R-squared: 0.838, Adjusted R-squared: 0.8373
## F-statistic: 1298 on 2 and 502 DF, p-value: < 2.2e-16
Let’s give a marginal look at our response variable:
hist(bcast_linear.df$Latency,breaks = 70,
xlab="Latency(us)",
main="Histogram of Latency (Linear Broadcast)")
EVIDENT SKEWNESS! Let’s try to apply log2 transformation (in the model I will apply it both to Latency and MessageSize):
hist(log2(bcast_linear.df$Latency),breaks = 40,
xlab="log2(Latency(us))",
main="Histogram of log2 Latency (Linear Broadcast)")
bcast_linear.df %>%
filter(Allocation == "core") %>%
plot_ly(
x = ~log2(MessageSize),
y = ~Processes,
z = ~log2(Latency)
)%>%
add_markers(
marker = list(color = ~log2(Latency),
line = list(color = "black",width = 0.8 ))
) %>%
layout(
xaxis = list(title = "Processes"),
yaxis = list(title = "Latency"),
title = "Latency vs Processes for Linear algorithm (Allocation='core')"
)
The linear model becomes:
fit.linear.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
data=bcast_linear.df %>% filter(Allocation == "core")
)
summary(fit.linear.log)
##
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize),
## data = bcast_linear.df %>% filter(Allocation == "core"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4898 -1.1118 -0.5450 0.3623 3.6810
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Processes 0.029815 0.003808 7.83 2.92e-14 ***
## log2(MessageSize) 0.321259 0.009308 34.51 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.62 on 502 degrees of freedom
## Multiple R-squared: 0.8824, Adjusted R-squared: 0.8819
## F-statistic: 1883 on 2 and 502 DF, p-value: < 2.2e-16
Comment: \(R^2_{adj}\) increases from 83,7% to 88,2% !!!
bcast_chain.df %>%
filter(Allocation == "core") %>%
plot_ly(
x = ~log2(MessageSize),
y = ~Processes,
z = ~log2(Latency)
)%>%
add_markers(
marker = list(color = ~log2(Latency),
line = list(color = "black",width = 0.8 ))
) %>%
layout(
xaxis = list(title = "Processes"),
yaxis = list(title = "Latency"),
title = "Latency vs Processes for chain algorithm (Allocation='core')"
)
fit.chain.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
data=bcast_chain.df %>% filter(Allocation == "core")
)
summary(fit.chain.log)
##
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize),
## data = bcast_chain.df %>% filter(Allocation == "core"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3286 -1.3089 -0.5489 0.6052 3.1535
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Processes 0.021957 0.003739 5.873 7.8e-09 ***
## log2(MessageSize) 0.306713 0.009139 33.560 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.591 on 502 degrees of freedom
## Multiple R-squared: 0.8683, Adjusted R-squared: 0.8677
## F-statistic: 1654 on 2 and 502 DF, p-value: < 2.2e-16
bcast_binarytree.df %>%
filter(Allocation == "core") %>%
plot_ly(
x = ~log2(MessageSize),
y = ~Processes,
z = ~log2(Latency)
)%>%
add_markers(
marker = list(color = ~log2(Latency),
line = list(color = "black",width = 0.8 ))
) %>%
layout(
xaxis = list(title = "Processes"),
yaxis = list(title = "Latency"),
title = "Latency vs Processes for BinaryTree algorithm (Allocation='core')"
)
fit.binarytree.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
data=bcast_binarytree.df %>% filter(Allocation == "core")
)
summary(fit.binarytree.log)
##
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize),
## data = bcast_binarytree.df %>% filter(Allocation == "core"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4247 -1.3128 -0.6351 0.5242 3.6288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Processes 0.016315 0.003766 4.332 1.79e-05 ***
## log2(MessageSize) 0.317449 0.009207 34.480 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.603 on 502 degrees of freedom
## Multiple R-squared: 0.8663, Adjusted R-squared: 0.8658
## F-statistic: 1626 on 2 and 502 DF, p-value: < 2.2e-16
fit.default.log<- lm(formula= log2(Latency) ~ -1+Processes+log2(MessageSize),
data=bcast_default.df %>% filter(Allocation == "core")
)
summary(fit.default.log)
##
## Call:
## lm(formula = log2(Latency) ~ -1 + Processes + log2(MessageSize),
## data = bcast_default.df %>% filter(Allocation == "core"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4849 -1.2933 -0.5753 0.5509 3.2011
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Processes 0.009586 0.003781 2.535 0.0115 *
## log2(MessageSize) 0.321418 0.009242 34.778 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.609 on 502 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8586
## F-statistic: 1532 on 2 and 502 DF, p-value: < 2.2e-16